SPLIT: Smart Preprocessing (Quasi) Language Independent Tool

نویسندگان

  • Mohamed Al-Badrashiny
  • Arfath Pasha
  • Mona T. Diab
  • Nizar Habash
  • Owen Rambow
  • Wael Salloum
  • Ramy Eskander
چکیده

Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting and Preventing Power Outages in a Smart Grid using eMoflon

We present a solution to the Outage System Case of the Transformation Tool Contest 2017, based on the bidirectional model transformation language eMoflon. The case comprises two tasks, in which the goal is to produce custom model views on a set of input models from a smartgrid system, eventually allowing power outages to be detected and prevented. To facilitate understandability, our solution u...

متن کامل

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

Normalization and Compilation of Deductive and Object-Oriented Databases Programs for Efficient Query Evaluation

A normalization process is proposed to serve not only as a preprocessing stage for compilation and evaluation but also as a tool for classifying recursions. Then the query-independent compilation and chain-based evaluation method can be extended naturally to process a class of DOOD programs and queries. The query-independent compilation captures the bindings that could be diicult to be captured...

متن کامل

magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian

Hungarian is the stereotype of morphologically rich and free word order languages. Here, we introduce magyarlanc, a natural language toolkit developed for the linguistic preprocessing – segmentation, morphological analysis, POS-tagging and dependency parsing – of Hungarian texts. We hope that the free availability of the toolkit fosters the research not just on the Hungarian language but on all...

متن کامل

The Effect of Smart Board Technology on Iranian EFL Learners’ Achievement Motivation and Willingness to Communicate

This study aimed at investigating the effect of using smart board technology on the EFL learners' achievement motivation and willingness to communicate (WTC). The participants were 65 second grade female students from Shahid Nazari girls’ high school in Andimeshk, Iran, who were selected randomly. An OPT was administrated to the the participants to homogenize them. Other instruments were Herman...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016